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Abstract. — Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, 
and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and 
maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches 
remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such 
problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emer- 
gence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational 
performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow 
existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a com- 
mon library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, 
to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance 
statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations 
on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently ex- 
ploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions 
and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages 
of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE 
library is free open source software licensed under the Lesser GPL and available from http:/ /beagle-lib. googlecode.com. 
An example client program is available as public domain software. [Bayesian phylogenetics; GPU; maximum likelihood; 
parallel computing.] 



Most modern approaches to statistical phylogenetic 
inference involve computing the probability of observed 
character data for a set of taxa given a phylogenetic 
model — often a tree and continuous-time Markov chain 
model of character state evolution. Felsenstein (1981) 
demonstrated an efficient algorithm to calculate this 
probability, which is often referred to as the likelihood 
of the model. His algorithm recursively computes par- 
tial likelihoods via simple sums and products. These 
partial likelihoods track the probability of the observed 
data descended from an internal node conditional on 
a particular state at that internal node. A library that 
implements the calculations required by Felsenstein's 
algorithm is appealing because this procedure accounts 
for the majority of computing time in most likelihood- 
based phylogenetic operations. Furthermore, the algo- 
rithm offers opportunities for parallelization. 

In typical phylogenetic models, likelihood calcula- 
tion operations assume independence at several levels. 
These independencies provide the opportunity to per- 
form operations in parallel. For example, models often 
assume that sites in a sequence alignment evolve in- 
dependently, so that one can compute the likelihood 



for each site separately. The product of site likelihoods 
yields the likelihood for the alignment. In models that 
include among-site rate variation via a finite mixture, 
it is often possible to calculate conditional likelihoods 
given each rate category in parallel. Several other op- 
portunities for parallelism exist at a finer scale. 

We have developed the software library BEAGLE: 
Broad-platform Evolutionary Analysis General Likeli- 
hood Evaluator. BEAGLE provides a uniform interface 
for calculating phylogenetic likelihoods under a vari- 
ety of different phylogenetic models. The library im- 
plements parallelism in the likelihood calculation on 
important emerging computer hardware technology, in- 
cluding graphics processing units (GPUs) and multicore 
central processing units (CPUs). We intend for users to 
install the library as a shared resource to be used by any 
phylogenetic software that supports the library. This 
approach allows developers of phylogenetic software to 
share any optimizations of the core calculations and any 
package that uses BEAGLE will automatically benefit 
from the improvements to the library. For researchers, 
this centralization provides a single installation to 
take advantage of new hardware and parallelization 
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techniques. We now describe the interface to the library 
and some details regarding its implementation. 



Application Programming Interface 
Key Concepts 

The key to BEAGLE performance lies in delivering 
fine-scale parallelization while minimizing data trans- 
fer and memory copy overhead. To accomplish this, the 
library lacks the concept or data structure for a tree, 
in spite of the intended use for phylogenetic analysis. 
Instead, BEAGLE acts directly on flexibly indexed data 
storage (called buffers) for observed character states and 
partial likelihoods. The client program can set the input 
buffers to reflect the data and can calculate the likeli- 
hood of a particular phylogeny by invoking likelihood 
calculations on the appropriate input and output buffers 
in the correct order. Because of this design simplicity, 
the library can support many different tree inference 
algorithms and likelihood calculation on a variety of 
models. Arbitrary numbers of states can be used, as can 
nonreversible substitution matrices via complex eigen 
decompositions, and mixture models with multiple 
rate categories and /or multiple eigen decompositions. 
Finally, BEAGLE application programming interface 
(API) calls can be asynchronous, allowing the calling 
program to implement other coarse-scale paralleliza- 
tion schemes such as evaluating independent genes or 
running concurrent Markov chains. 



Usage 

To use the library, a client program first creates an 
instance of BEAGLE by calling beagleCreatelnstance 
(further API method names can be found in the 
documentation distributed with the library); multiple 
instances per client are possible and encouraged. All 
additional functions are called with a reference to this 
instance. The client program can optionally request that 
an instance run on certain hardware (e.g., a GPU) or 
have particular features (e.g., double-precision math). 
Next, the client program must specify the data dimen- 
sions and specify key aspects of the phylogenetic model. 
Character state data are then loaded and can be in the 
form of discrete observed states or partial likelihoods 
for ambiguous characters. The observed data are usually 
unchanging and loaded only once at the start to mini- 
mize memory copy overhead. The character data can be 
compressed into unique "site patterns" and associated 
weights for each. The parameters of the substitution 
process can then be specified, including the equilibrium 
state frequencies, the rates for one or more substitution 
rate categories and their weights, and finally, the eigen 
decomposition for the substitution process. 

In order to calculate the likelihood of a particular tree, 
the client program then specifies a series of integration 
operations that correspond to steps in Felsenstein's algo- 
rithm. Finite-time transition probabilities for each edge 



are loaded directly if considering a nondiagonalizable 
model or calculated in parallel from the eigen decompo- 
sition and edge lengths specified. This is performed 
within BEAGLE's memory space to minimize data 
transfers. A single function call will then request one 
or more integration operations to calculate partial like- 
lihoods over some or all nodes. The operations are 
performed in the order they are provided, typically dic- 
tated by a postorder traversal of the tree topology. The 
client needs only specify nodes for which the partial 
likelihoods need updating, but it is up to the calling 
software to keep track of these dependencies. The final 
step in evaluating the phylogenetic model is done using 
an API call that yields a single log likelihood for the 
model given the data. 

Aspects of the BEAGLE API design support both 
maximum likelihood (ML) and Bayesian phylogenetic 
tree inference. For ML inference, API calls can calculate 
first and second derivatives of the likelihood with re- 
spect to the lengths of edges (branches). In both cases, 
BEAGLE provides the ability to cache and reuse pre- 
viously computed partial likelihood results, which can 
yield a tremendous speedup over recomputing the en- 
tire likelihood every time a new phylogenetic model is 
evaluated. 



Materials and Methods 

The core BEAGLE library is implemented in C++ with 
C and Java JNI interfaces. BEAGLE uses a runtime mod- 
ule loading system to load hardware-specific plugins 
(shared libraries) when suitable hardware is available. 
Current plugins implement BEAGLE on GPUs using 
CUDA and OpenCL (in development), CPUs with vec- 
tor instructions using Streaming SIMD Extensions (SSE), 
and multicore systems via OpenMP. BEAGLE is avail- 
able for Linux, Mac, and Windows operating systems 
and is packaged with conventional installer methods 
for each. 



GPU Implementation 

The GPU implementation of BEAGLE supports both 
single- and double-precision arithmetic. Single preci- 
sion requires more frequent use of a rescaling scheme 
to avoid underflow but allows BEAGLE to run on a 
greater variety of graphics processors since initial gen- 
erations of such hardware did not include support for 
double-precision math. The GPU does fine-scale par- 
allelization of the likelihood calculation, primarily by 
parallelizing across alignment sites, rate categories, and 
state values. Models such as amino acid (20 states) or 
codon models (64 states), therefore, permit a greater de- 
gree of parallelization than nucleotide models (4 states) 
and also yield the most notable speedups on GPU hard- 
ware (Suchard and Rambaut 2009). The CUDA kernels 
load using the CUDA driver API, which enables them 
to be compiled at runtime and utilize features specific 
to the particular hardware and CUDA version installed. 
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Multiple GPUs can be seamlessly utilized simultane- 
ously via multiple BEAGLE instances. 



CPU-based Implementations 

In addition to a standard serial CPU implementation, 
BEAGLE includes two other CPU-based implemen- 
tations that exploit parallelism in different ways. An 
SSE implementation in double precision uses vector 
processing extensions present in many CPUs to paral- 
lelize computation across character state values. Single- 
precision SSE vectorization has not been a BEAGLE 
priority as other phylogenetic tools already provide 
this feature (Ronquist and Huelsenbeck 2003; Swof- 
ford 2003) and, so, is not yet available in BEAGLE. The 
OpenMP implementation uses multiple threads to par- 
allelize computation across rate categories. Although 
finer-scale parallelization, equivalent to that achieved 
for GPU devices, could be attempted, it is unlikely to 
yield significant speedups due to the thread synchro- 
nization overhead in the OpenMP model. 



Example 

Program Speedups 

Currently, three popular phylogenetic software 
packages interface with BEAGLE: MrBayes (Ronquist 
and Huelsenbeck 2003) and BEAST (Drummond and 
Rambaut 2007), which use Bayesian inference, and 
GARLI (Zwickl 2006), which uses an ML approach. 
We benchmarked each of these programs to compare 
the speed of their native likelihood calculators to the 
BEAGLE implementations. In order to better exploit 
the parallelism offered by the GPU implementation, we 
used a data set with a large number of alignment sites 
and ran it under both nucleotide and codon models. 
More specifically, the data set used had 15 taxa and 
18,792 nucleotide columns, 8558 of which were unique; 
for the codon model, 6080 of the 6264 site patterns were 
unique. This data set was a subset of a larger arthro- 
pod data set (Regier et al. 2010). We performed these 
benchmarks on a standard desktop PC with a 2.9 GHz 
Intel Core i7-930 CPU and 6 GB of 1.6 GHz DDR3 RAM. 
The PC was equipped with an NVIDIA GTX 580 GPU, 
with 1.5 GB of RAM and 512 processing cores running 
at 1.5 GHz. Figure 1 shows runtime speedups for each 
program when using BEAGLE CPU, SSE, and GPU 
implementations under nucleotide and codon models. 
For the GPU implementation, we also benchmarked 
in single-precision mode. Reported speedups are rel- 
ative to the runtime when using the native sequential 
CPU implementation of each program. We note that 
the GARLI interface with BEAGLE is not fully opti- 
mized. Although we expect that further integration 
work will produce positive results, in our tests, only 
the GPU implementation achieved effective speedups. 
We have thus omitted the results from the CPU-based 
implementations . 



Precision 



O X 

Q. 
=s 

"O 

8. $ 

CO 



double single 
i i 


double 

i 


single 

i 


A BEAGLE GPU 






■ BEAGLE SSE 






o BEAGLE CPU 


A 12 


A 19 


„ „ A 5.1 

a 3.8 






i 

Nucleotide model 


i 

Codon model 


GARLI 




double single 
i i 


double 
i 


single 

i 






A 61 




a 32 




▲ 16 A 19 






. 3.5 






° 2.3 


. 1-6 

° 1.0 





Nucleotide model Codon model 
MrBayes 



double single 



double single 




Nucleotide model Codon model 
BEAST 

FIGURE 1. Performance using the BEAGLE library relative to the 
native sequential CPU implementations of phylogenetic analysis pro- 
grams GARLI, MrBayes, and BEAST. Speedup factors are on a log 
scale. 
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For the BEAGLE GPU implementation, we observe 
significant speedups across all programs. The speedups 
are largest under the codon models, as they allow for 
better utilization of the GPU cores. We also observe the 
higher performance cost of double-precision calcula- 
tion on the GPU relative to single precision. Overall, 
the highest speedup is 71-fold, for the BEAGLE GPU 
single-precision implementation when compared with 
the BEAST native implementation, under the codon 
model. 

We note that not every analysis run on a GPU will 
achieve the same speedups we report, and, in some cir- 
cumstances, using the BEAGLE GPU implementation 
may result in a slower overall runtime than using a 
CPU implementation. Several factors affect the relative 
performance. Beyond state-space size and numerical 
precision, the number of unique alignment columns 
and the hardware specifications of the GPU, espe- 
cially numbers of cores and memory bandwidth, are 
important factors. We recommend that users first as- 
sess the relative performance of the GPU implementa- 
tion with their setup by performing short comparative 
runs, which specify a smaller chain length or fewer 
generations. 

Conclusion 

BEAGLE is an API and library for high-performance 
evaluation of phylogenetic likelihoods. The API pro- 
vides a uniform interface for performing calculations on 
an expanding variety of computer hardware platforms 
including GPUs, multicore CPUs, and SSE vectoriza- 
tion. On GPUs, the library provides novel algorithms 
and methods for evaluating likelihoods under arbi- 
trary molecular evolutionary models, harnessing the 
large number of processing cores to efficiently paral- 
lelize calculations. Current results show speedups of 
up to 71 -fold on a single GPU over CPU-based likeli- 
hood calculators. BEAGLE is currently integrated with 
three state-of-the-art phylogenetic software packages: 
MrBayes, BEAST, and GARLI, and compatible with 
many more. Forthcoming extensions include OpenCL 
support, single-precision SSE vectorization, improved 
performance for highly partitioned data sets, and addi- 
tional high-level language wrappers, such as Python. 



BEAGLE is freely available from http: / /beagle-lib. 
googlecode.com under the GNU Lesser General Pub- 
lic License and new collaborators are welcome. 
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